[Pipelines] Add DreamLite text-to-image and image-edit pipelines#13815
[Pipelines] Add DreamLite text-to-image and image-edit pipelines#13815Carlofkl wants to merge 6 commits into
Conversation
Add ByteDance's DreamLite model family to diffusers. DreamLite is a
UNet-based diffusion model that supports both text-to-image generation
and reference-image editing through a shared 3-branch dual-CFG design.
Two pipelines are shipped:
* DreamLitePipeline - full 3-branch dual CFG (negative,
reference, prompt); supports T2I and
I2I editing at 1024x1024.
* DreamLiteMobilePipeline - distilled single-branch variant for
on-device inference; no CFG.
New model code (all isolated under *_dreamlite.py / unet_dreamlite.py
to avoid touching shared upstream files):
* models/transformers/transformer_2d_dreamlite.py - DreamLite 2D
transformer block.
* models/unets/unet_dreamlite.py - DreamLiteUNetModel.
* models/unets/unet_2d_blocks_dreamlite.py - DreamLite-specific
down/up/mid blocks.
* models/resnet_dreamlite.py - DreamLite ResNet
variants.
* models/attention_processor.py - add
DreamLiteAttnProcessor2_0 (pure addition, no existing processor
modified).
Pipeline + tests + docs:
* pipelines/dreamlite/{__init__.py, pipeline_dreamlite.py,
pipeline_dreamlite_mobile.py, pipeline_output.py}.
* tests/pipelines/dreamlite/{test_pipeline_dreamlite.py,
test_pipeline_dreamlite_mobile.py} with the standard
PipelineTesterMixin suite; setUp/tearDown auto-patches encode_prompt
with a fake so MagicMock text encoders work without per-test
boilerplate.
* Skip 8 mixin tests that don't apply to DreamLite (MagicMock
serialisation, custom attention processor, encode_prompt return
shape, batch_size > 1 sweep), mirroring SD3 / Flux conventions.
* docs/source/en/api/pipelines/dreamlite.md + _toctree.yml entry
(alphabetically between DiT and EasyAnimate).
* Register exports in 6 __init__.py files.
Two real bugs surfaced by the mixin test suite are fixed in this
commit:
* num_images_per_prompt > 1: prompt_embeds and text_attention_mask
are now repeated along the batch dimension in both pipelines'
T2I and I2I branches before being passed to the UNet.
* vae=None: __init__ now guards the encoder_block_out_channels
lookup so encode_prompt can be tested in isolation per
PipelineTesterMixin convention.
SlowTests real-checkpoint resolution is set to 1024x1024 (the only
size DreamLite is trained for).
Test result: 27 passed, 50 skipped, 0 failed on CPU fast suite.
make style && make quality: clean.
The `carlofkl/DreamLite-{base,mobile}` Hub repos host two flavours of the
same checkpoint:
* `main` branch - keeps `model_index.json` pointing at ByteDance's
internal package path so the original (non-diffusers)
reference code can still load these weights.
* `diffusers` branch - rewrites the `unet` entry of `model_index.json` to
`["diffusers", "DreamLiteUNetModel"]` so this
integration loads correctly from `diffusers`.
This commit pins every `from_pretrained(...)` call shipped with the
diffusers integration (docs examples, pipeline docstrings, SlowTests) to
`revision="diffusers"`. Local-override env vars (DREAMLITE_BASE_PATH /
DREAMLITE_MOBILE_PATH) still bypass the revision pin.
…ts after rebase Mechanical changes after rebasing onto current `main`: * `pipeline_dreamlite.py::retrieve_timesteps` — re-synced from `diffusers.pipelines.flux.pipeline_flux.retrieve_timesteps` (PEP 604 type hints, expanded docstring, plus the new `accepts_timesteps` / `accept_sigmas` introspection guards). DreamLite's default code path uses `num_inference_steps` (uniform schedule) and never passes custom `timesteps` / `sigmas`, so the added guards are dead-code for this pipeline — behaviour is unchanged. * `dummy_pt_objects.py` / `dummy_torch_and_transformers_objects.py` — registered the dummy classes auto-generated by `make fix-copies` for `DreamLiteTransformer2DModel`, `DreamLiteUNetModel`, `DreamLitePipeline`, `DreamLiteMobilePipeline`, `DreamLitePipelineOutput`. Generated by `make fix-copies`. No hand edits.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…ing entries - Register DreamLiteAttnProcessor2_0 in docs/source/en/api/attnprocessor.md (fixes check_support_list.py). - Split combined 'height / width' and 'guidance_scale / image_guidance_scale' entries in the two pipeline docstrings; add a complete Args block to DreamLiteTransformer2DModel.forward (fixes check_forward_call_docstrings.py). No behavioral change.
|
Hi @sayakpaul @yiyixuxu — pushed a small follow-up commit (
No behavioral change — docs/docstrings only. Verified both lints pass locally. Whenever convenient, could you re-approve the workflows? Thanks! |
|
Hi @yiyixuxu @DN6 @sayakpaul — quick update: CI is now fully green |
| # --------------------------------------------------------------------------- | ||
| # Down blocks | ||
| # --------------------------------------------------------------------------- | ||
| def _make_down_block_class(class_name: str, *, remove_self_attn: bool): |
There was a problem hiding this comment.
Rather than having _make_down_block_class, I think we should define the down block classes directly:
# Version with both self attention and cross attention
class DreamLiteAttnDownBlock2D(nn.Module):
...
# Version with only cross attention
class DreamLiteCrossAttnDownBlock2D(nn.Module):
...since this would be more clear, at the cost of some duplicated code. (I also think the current names are hard to follow, especially CrossAttnUpRemoveSelfAttnBlock2DV1DreamLite.)
| # --------------------------------------------------------------------------- | ||
| # Up blocks | ||
| # --------------------------------------------------------------------------- | ||
| def _make_up_block_class(class_name: str, *, remove_self_attn: bool): |
There was a problem hiding this comment.
Analogous comment to #13815 (comment): I think it would be better to define the two attention up block classes directly.
| # --------------------------------------------------------------------------- | ||
| # Plain resnet-only blocks (no attention) | ||
| # --------------------------------------------------------------------------- | ||
| class DownBlock2DDreamLite(nn.Module): |
There was a problem hiding this comment.
| class DownBlock2DDreamLite(nn.Module): | |
| class DreamLiteDownBlock2D(nn.Module): |
nit: I think the above suggestion better follows the current diffusers model naming patterns.
| return hidden_states, output_states | ||
|
|
||
|
|
||
| class UpBlock2DDreamLite(nn.Module): |
There was a problem hiding this comment.
| class UpBlock2DDreamLite(nn.Module): | |
| class DreamLiteUpBlock2D(nn.Module): |
Similar comment to #13815 (comment).
| from ..attention_processor import Attention, DreamLiteAttnProcessor2_0 | ||
| from ..normalization import RMSNorm | ||
| from .unet_2d_blocks_dreamlite import ( | ||
| CrossAttnDownBlock2DDreamLite, | ||
| CrossAttnDownRemoveSelfAttnBlock2DDreamLite, | ||
| CrossAttnUpBlock2DDreamLite, | ||
| CrossAttnUpRemoveSelfAttnBlock2DV1DreamLite, | ||
| DownBlock2DDreamLite, | ||
| UNetMidBlock2DCrossAttnDreamLite, | ||
| UpBlock2DDreamLite, | ||
| ) |
There was a problem hiding this comment.
I think implementing all of the DreamLite model blocks in a single file (like how recent transformer models are implemented) would better follow the current model design. CC @yiyixuxu
| device: torch.device, | ||
| dtype: torch.dtype, | ||
| image: Optional[Image.Image] = None, | ||
| max_sequence_length: int = 500, |
There was a problem hiding this comment.
It looks like max_sequence_length is currently unused in encode_prompt, is this intentional?
| text_pad_embedding: Optional[torch.Tensor] = None, | ||
| ): | ||
| if mode == "edit": | ||
| drop_idx = 64 |
There was a problem hiding this comment.
Can we document what drop_idx means here? Would it be possible to get this value from e.g. self.processor instead of hardcoding it?
| ) | ||
|
|
||
| elif mode == "generate": | ||
| drop_idx = 34 |
| if num_images_per_prompt > 1: | ||
| prompt_embeds = prompt_embeds.repeat_interleave(num_images_per_prompt, dim=0) | ||
| text_attention_mask = text_attention_mask.repeat_interleave(num_images_per_prompt, dim=0) | ||
| image_processed = self.image_processor.preprocess(image.resize((width, height), Image.Resampling.LANCZOS)) |
There was a problem hiding this comment.
| image_processed = self.image_processor.preprocess(image.resize((width, height), Image.Resampling.LANCZOS)) | |
| image_processed = self.image_processor.preprocess(image, height=height, width=width) |
Would the above suggestion work? VaeImageProcessor's default resample value is "lanczos", so I think we should be able to call preprocess normally instead of manually resizing the image first.
| noise_pred = noise_pred[..., : latents.shape[-1]] | ||
| if task == "generate": | ||
| noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2) | ||
| noise_pred = noise_pred_uncond + self._guidance_scale * (noise_pred_cond - noise_pred_uncond) |
There was a problem hiding this comment.
| noise_pred = noise_pred_uncond + self._guidance_scale * (noise_pred_cond - noise_pred_uncond) | |
| noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_cond - noise_pred_uncond) |
nit: I think we should use the guidance_scale property here.
| + self._guidance_scale * (noise_pred_text - noise_pred_image) | ||
| + self._image_guidance_scale * (noise_pred_image - noise_pred_uncond) | ||
| ) | ||
|
|
There was a problem hiding this comment.
| noise_pred = ( | |
| noise_pred_uncond | |
| + self.guidance_scale * (noise_pred_text - noise_pred_image) | |
| + self.image_guidance_scale * (noise_pred_image - noise_pred_uncond) | |
| ) |
nit: similar comment to #13815 (comment).
| from ..stable_diffusion.pipeline_stable_diffusion_img2img import retrieve_latents | ||
| from .pipeline_dreamlite import calculate_shift, retrieve_timesteps |
There was a problem hiding this comment.
| from ..stable_diffusion.pipeline_stable_diffusion_img2img import retrieve_latents | |
| from .pipeline_dreamlite import calculate_shift, retrieve_timesteps |
We prefer to copy helper functions like these and use the # Copied from mechanism to sync the implementations, similar to how retrieve_timesteps is implemented in src/diffusers/pipelines/dreamlite/pipeline_dreamlite.py.
| @staticmethod | ||
| def _extract_masked_hidden(hidden_states: torch.Tensor, mask: torch.Tensor) -> List[torch.Tensor]: |
There was a problem hiding this comment.
| @staticmethod | |
| def _extract_masked_hidden(hidden_states: torch.Tensor, mask: torch.Tensor) -> List[torch.Tensor]: | |
| @staticmethod | |
| # Copied from diffusers.pipelines.dreamlite.pipeline_dreamlite.DreamLitePipeline._extract_masked_hidden | |
| def _extract_masked_hidden(hidden_states: torch.Tensor, mask: torch.Tensor) -> List[torch.Tensor]: |
We should use the # Copied from mechanism for all copied helper methods so that the implementations are synced.
dg845
left a comment
There was a problem hiding this comment.
Thanks for the PR! Left an initial design review :).


Context
This PR integrates DreamLite — ByteDance's text-to-image / image-edit diffusion model — into
diffusers, following an invitation from @NielsRogge to release the model on the Hub indiffusersformat.Related issue: ByteVisionLab/DreamLite#3 (comment)
Model cards (public, ungated):
Both repos use a
diffusersbranch (loaded viarevision="diffusers") to keep the original ByteDance-internalmainbranch intact for backward compatibility with existing users.What's added
Architecture highlights
DreamLiteUNetModel— UNet-based denoiser conditioned on Qwen3-VL text/vision embeddings.DreamLitePipeline— runs 3 forward passes per step (text-cond / image-cond / uncond) and combines them with a dual-CFG schedule for high-fidelity text-to-image and image edit.DreamLiteMobilePipeline— distilled single-pass variant; no CFG; designed for on-device inference. Pairs withAutoencoderTiny.FlowMatchEulerDiscreteScheduler.Testing
carlofkl/DreamLite-basewithrevision="diffusers"— all 6 sub-modules resolve to the correctdiffusers.*namespace.std≈93, no NaN/Inf).tests/pipelines/dreamlite/.Before submitting
Who can review?
cc @sayakpaul @yiyixuxu @DN6 — thanks in advance for the review!